In [8]:
%%HTML
<script src = "require.js" > </script >
In [3]:
from IPython.display import HTML, display, display_html

HTML("""<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Out[3]:
In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pyspark.pandas as ps
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import plotly.graph_objects as go

from scipy.stats import (
    norm, t, kstest, mannwhitneyu, levene, kruskal, ks_2samp
)

import plotly
import joblib
import json
import warnings
warnings.filterwarnings("ignore")

Executive Summary

  Dota 2, a popular multiplayer online battle arena game, offers a rich dataset for analysis due to its dynamic gameplay and complex mechanics. This study aimed to gain insights into the game's mechanics and match outcomes by exploring matches data. We focused on answering the following research questions:

  • What is the distribution of matches across game modes and which mode is most popular?
  • What is the typical match duration, and are there any patterns or outliers?
  • Is there a difference in match duration based on the winning side (Radiant or Dire)?
  • Does the timing of the first kill (first blood) impact the duration of the match?
  • What is the usual status of towers and barracks upon conclusion of match?

  Our approach used big data techniques to handle over 16GB worth of matches data. We extracted the data, and converted it to a more efficient parquet format partitioned by the 'lobby_type' column for faster processing. We used Spark for distributed data processing, and performed exploratory data analysis to generate insights. We also employed the Mann-Whitney U test to assess the significance of differences in match durations between Radiant and Dire wins.

  Our findings revealed that All Pick, All Draft, and Single Draft were the most popular game modes. The average match duration was around 40 minutes, though this varied widely. Interestingly, we found a significant difference in match duration depending on the winning side, with Radiant wins tending to result in shorter matches. The timing of the first blood, however, did not significantly impact the match duration. Furthermore, we found a pattern wherein the winning team often significantly destroyed the opposing team's structures while preserving their own, indicating strategic advantages gained through improved map control and increased gold accumulation.

  For future work, we recommend expanding the dataset to include more recent matches or different periods. An in-depth analysis of game modes and why they are preferred by players could also be valuable. The observed difference in match durations between Radiant and Dire wins warrants further investigation to determine potential game imbalances. Additionally, exploring the impact of first blood timing on game outcomes and a study of player behavior could provide further insights into the complex dynamics of Dota 2 matches. Through these recommendations, future research can continue to contribute to the game's ongoing development and the enjoyment of its players.

Introduction

 Dota 2, developed and published by Valve Corporation, is a globally celebrated multiplayer online battle arena (MOBA) game. In each match, two teams of five players each, known as the Radiant and the Dire, compete to destroy the opposing team's "Ancient", a large structure located within their base. This premise forms the core of the gameplay, though various modes and rulesets exist add different flavors of gameplay.

 The game features multiple game modes, each providing unique strategic and gameplay variations. Some of these include All Pick, Captains Mode, Random Draft, Single Draft, All Random, and others. These game modes vary in terms of how players select their characters, known as heroes, and the game rules they follow.

 The difference between the Radiant and the Dire isn't merely cosmetic; the game map is asymmetric, meaning each side has unique strategic advantages and disadvantages. The Radiant side is located at the bottom left of the map, while the Dire is at the top right. Different tower placements, jungle layouts, and other map features can influence team strategies and the flow of the game.

 An important event in every Dota 2 match is the 'first blood' - the first kill of the game. This event not only provides a psychological advantage but also grants additional gold to the killer, potentially affecting the early-game dynamic between the teams.

 This report focuses on an exploratory data analysis (EDA) of a dataset containing detailed data from numerous Dota 2 matches. EDA is a critical first step in the data analytics process. By understanding the structure, trends, and outliers in our data, we can better understand the factors that impact a match's outcome and contribute to a player's or team's performance.

Figure 1. Fan-made Labeled Map of Dota 2 lifted from Reddit.

Problem Statement

 While there's a wealth of data available from Dota 2 matches, it has not been extensively leveraged to understand patterns in gameplay and its various dimensions. To address this, we will conduct an in-depth exploratory data analysis on a dataset of Dota 2 matches. This will involve an investigation of various aspects of the game, such as match duration, game mode popularity, the differences between Radiant and Dire victories, and the impact of early game events like first blood on match outcomes.

 Specifically, we aim to answer the following research questions:

  • How many matches were played in each game mode, and which game mode is the most popular?
  • What is the distribution of match durations in the dataset? Are there any outliers or patterns in the data?
  • Is there a relationship between the duration of a match and the type of win (Radiant or Dire)? Do certain types of victories result in longer match durations?
  • What is the relationship between the time of the first kill (first blood) and the duration of the match?
  • What is the distribution of towers and barracks advantage in the dataset? Are there any patterns in the data?

 By answering these questions, we hope to provide gamers, teams, and even tournament organizers with insights that can enhance strategies, improve training methods, and ultimately contribute to a better understanding of the game's dynamics.

Motivation

 The motivation behind this project arises from the ever-evolving dynamics of the Dota 2 game, with each match being unique in its strategies, play style, and outcomes. Despite the complexity and variability, there are underlying patterns and trends in the game data that can provide valuable insights to players, teams, game analysts, and the broader Dota 2 community.

 The exploration of match data, including the popularity of different game modes, the distribution of match durations, and the correlation of early game events with match outcomes, can shed light on effective strategies, common pitfalls, and areas of improvement for players and teams. For instance, understanding the relationship between the duration of a match and the type of win could help teams strategize their game plans better. Moreover, insights from this analysis could help game developers and tournament organizers in designing balanced game modes, creating engaging events, and understanding player behavior better. This, in turn, can lead to a more enjoyable and competitive gaming experience for the players and a more engaging spectacle for the viewers.

 In the broader perspective, this project contributes to the growing field of esports analytics, a discipline that applies data analysis techniques to competitive video gaming. Insights derived from such studies can aid in enhancing player performance, informing team strategies, influencing game development, and even shaping the future of esports broadcasting. Through this project, we aim to transform raw data into meaningful insights, leveraging the power of data analysis to deepen our understanding of Dota 2's complex gameplay dynamics. As such, our exploration serves not only the avid Dota 2 community but also contributes to the larger narrative of data-driven decision-making in esports and beyond.

Data Source

 The dataset used for this analysis is a comprehensive collection of Dota 2 match data, accessible via the Academic Torrents platform, specifically from the OpenDota (formerly YASP) data dumps. OpenDota is an open-source platform that provides Dota 2 related data and analytics [1]. The platform obtains its data through the Dota 2 Application Programming Interface (API), provided by Valve Corporation, which offers match details, player statistics, and other relevant information [2].

 The dataset is rich and multifaceted, containing an array of features that capture various aspects of a Dota 2 match. These include:

Table 1. Data Description of the OpenDota Match dataset

Variable Data Description
match_id, match_seq_num Unique identifiers for each match.
radiant_win A boolean value indicating whether the Radiant team won the match.
start_time, duration Information about when the match started and how long it lasted.
tower_status_radiant, tower_status_dire: The status of each team's towers at the end of the match.
barracks_status_radiant, barracks_status_dire The status of each team's barracks at the end of the match.
cluster Identifier for the server that the match was played on.
first_blood_time The game time when the first kill occurred.
lobby_type The type of lobby in which the match was played.
human_players The number of human players in the match.
leagueid Identifier for the league in which the match was played.
positive_votes, negative_votes The number of upvotes and downvotes the match received from players
leagueid Identifier for the league in which the match was played.
game_mode The mode of the game that was played.
engine The game engine version.
picks_bans Information about the heroes that were picked and banned during the pre-game drafting phase.
parse_status The status of the match data parsing process.
chat The in-game chat messages.
objectives The objectives completed during the match.
radiant_gold_adv, radiant_xp_adv The gold and experience point advantage for the Radiant team at different points during the match..
teamfights Information about the team fights that occurred during the match.
version The version of the game that was played.
pgroup The player groups in the match.

 This wide variety of data points provides an in-depth snapshot of each match, from the overarching structure to the minute details, offering a unique opportunity to explore and understand the multifaceted dynamics of Dota 2 matches.

Data Exploration

  Investigating the raw data is crucial to understand the scope of our dataset and the potential challenges, such as handling null values, that we might encounter during the data analysis process. The dataset was provided in a compressed .gz file format and contained an extensive amount of match data from Dota 2 games.The data comprised several columns, each representing different aspects of a match. Each of these columns also contained different types of data, including integers (for example, 'match_id'), booleans (for example, 'radiant_win'), and arrays (for example, 'picks_bans'). These columns provided a comprehensive overview of each match, encompassing everything from the time the first blood was drawn to the final status of the towers and barracks.

  Upon examination, we found that the match dataset was large - over 140GB - and encompassed a multitude of matches. It was also noted that the data contained null values in some columns. For instance, if a match had no recorded teamfights, the 'teamfights' column would be null for that match. Similarly, if a match had no chat records, the 'chat' column would be null. Some features such as pick_bans, objectives, radiant_gold_adv, an radiant_xp_adv contain too many missing values (more than half the number of rows) and so we must drop them for the succeeding analysis.

In [5]:
!du -sh /mnt/data/public/opendota/matches.gz
146G	/mnt/data/public/opendota/matches.gz
In [6]:
spark = (
    SparkSession
    .builder
    .master('local[*]')
    .config('spark.sql.execution.arrow.pyspark.enabled', 'true')
    .getOrCreate()
)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/15 04:45:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/15 04:45:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
In [7]:
df = ps.read_parquet(
    joblib.load('data_filepaths/parquet_filepath.pkl')
)
                                                                                
In [8]:
# Checking null values
df.isna().sum()
23/05/14 22:06:00 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                
Out[8]:
match_id                         0
match_seq_num                    0
radiant_win                      4
start_time                       0
duration                         0
tower_status_radiant             0
tower_status_dire                0
barracks_status_radiant          0
barracks_status_dire             0
cluster                          0
first_blood_time                 0
human_players                    0
leagueid                         0
positive_votes                   0
negative_votes                   0
game_mode                        0
engine                       27805
picks_bans                 1741448
parse_status               1147305
chat                       1377007
objectives                 1377025
radiant_gold_adv           1377027
radiant_xp_adv             1377100
teamfights                 1377153
version                    1377095
pgroup                          61
lobby_type                       0
dtype: int64

Methodology

 The methodology for this project was designed to effectively handle and analyze the large dataset using big data technologies like Apache Spark, leveraging its capabilities for distributed data processing and analysis. The methodology can be broken down into the following steps:

  • Data Extraction. The first step involved extracting a substantial amount of data, at least 16 GB, from the OpenDota data dumps in a format suitable for distributed systems. Given the scale of the data, it was crucial to ensure that the data was structured in a way that would facilitate efficient processing in a distributed environment. We chose the Parquet file format, which is a columnar storage file format optimized for use with big data processing frameworks. Parquet's efficient, per-column compression and encoding schemes result in reduced storage and efficient querying, making it ideal for our use case.
  • Data Loading. Once the data was extracted, we loaded it into Apache Spark, a powerful open-source unified analytics engine for large-scale data processing and analysis. Spark's ability to handle distributed data processing tasks in-memory and in near real-time made it an ideal choice for this project. Using the PySpark API, we were able to interact with Spark using Python, which allowed us to leverage the rich ecosystem of Python libraries for data analysis and visualization while benefiting from Spark's scalability and speed.
  • Exploratory Data Analysis. With the data loaded into Spark, we proceeded with the exploratory data analysis (EDA). This step involved a thorough examination of the dataset using descriptive statistics, data aggregation, and visualization techniques to understand the underlying structure of the data, identify outliers and anomalies, and discover patterns and insights. Key questions we sought to answer included the distribution of matches across different game modes, patterns in match durations, and relationships between match outcomes, match durations, and early game events.
  • Result Summarization and Visualization. Finally, the insights derived from the EDA were summarized and visualized using Python's data visualization libraries. Visual representation of data is a powerful way to communicate the findings as it makes complex data more understandable and actionable. Charts, plots, and other graphical representations were used to illustrate the distribution of data, trends, relationships, and patterns identified in the previous step.


Figure 2. Methodology of the Project

Data Pre-Processing

  Data pre-processing is a crucial step to ensure that the data is in a format that is suitable for analysis and eliminating any potential issues that could affect the results. This step was particularly important in our project given the size and complexity of the Dota 2 match data.

  Our first step in data pre-processing was to extract the first 16GB worth of rows from the matches.gz file and convert them into a CSV file. This allowed us to work with a manageable subset of the data without compromising the diversity and representativeness of the matches included in our analysis.

  Next, we converted the data from CSV format to Parquet format and partitioned the data by the lobby_type column. Parquet is a columnar storage file format that is optimized for use with big data processing tools. By storing data in columns rather than rows, Parquet allows for more efficient I/O operations, more effective compression, and faster query performance, especially for analytical queries that involve a subset of the columns in the data table [3]. Partitioning the data by the lobby_type column further enhanced the efficiency of our data processing tasks, as it allowed us to perform operations on specific subsets of the data without having to scan the entire dataset.

  Finally, we retained only matches occurring after the official release date of Dota 2 (July 9, 2013), and games with 10 human players. This ensured that our analysis was based only on matches played under the final, official rules of the game, excluding any matches from the beta testing phase that might not reflect the current dynamics of the game. Furthermore, by focusing only on games with 10 human players, we eliminated matches with bots, which can behave differently from human players and could therefore skew our analysis. Through these pre-processing steps, we were able to transform our raw data into a clean, structured, and efficient format that facilitated our subsequent exploratory data analysis.

In [8]:
gm = ["unknown", "all_pick", "captains_mode",
      "random_draft", "single_draft", "all_random",
      "intro", "diretide", "reverse_captains_mode",
      "greeviling", "tutorial", "mid_only",
      "least_played", "limited_heroes", "compendium_matchmaking",
      "custom", "captains_draft", "balanced_draft",
      "ability_draft", "event", "all_random_death_match",
      "1v1_mid", "all_draft", "turbo", "mutation",
      "coaches_challenge"]
d_gm = {k: v.replace('_', ' ').title() for k, v in zip(range(26), gm)}

drop_cols = ['pgroup', 'chat', 'objectives', 'version', 'radiant_win',
             'start_year', 'start_time']

df_filtered = (
    df.assign(
        start_time=(ps.to_datetime(df['start_time'], unit='s')),
        win_type=df["radiant_win"].apply(lambda x: 'radiant' if
                                         x == 't' else 'dire'),
        game_mode=df.game_mode.map(d_gm),
        duration=df.duration.astype(int),
        first_blood_time=df.first_blood_time.astype(int),
        duration_mins=df.duration.astype(int) // 60,
        first_blood_mins=df.first_blood_time.astype(int) // 60
    )
    .query('(start_time >= "2013-07-09") and (human_players = 10)')
    .drop(columns=drop_cols)
)
                                                                                

Mann Whitney U Test

  In our analysis, we also aim to explore whether there is a significant difference between Radiant and Dire wins in terms of certain game variables such as match duration. To test this hypothesis, we used the Mann-Whitney U Test, also known as the Wilcoxon Rank-Sum Test.

  The Mann-Whitney U Test is a non-parametric statistical test that is used to determine if there are differences between two independent groups on a continuous or ordinal variable. It is a non-parametric alternative to the Independent Samples t-test and can be used when the assumptions of the t-test (such as normality and homogeneity of variances) are not met [4].

  The test works by ranking all the values from both groups together and then summing the ranks for each group. The U statistic is then calculated, and based on this value, we determine whether the observed difference between the two groups is statistically significant.

$H_0$: Distribution of underlying sample $X_1$ is the same as the distribution of underlying sample $X_2$
$H_A$: Distribution of underlying sample $X_1$ is NOT the same as the distribution of underlying sample $X_2$

  By applying the Mann-Whitney U Test, we can determine whether any observed differences in these variables between Radiant and Dire wins are due to chance or whether they are statistically significant, thereby providing insights into the dynamics and outcomes of Dota 2 matches.

Results and Discussions

  The questions posed in this analysis aim to illuminate key aspects of Dota 2 gameplay. Each question focuses on a specific facet of the game, and collectively, they help us understand the overall dynamics, patterns, and strategies employed in Dota 2 matches. Answering these questions also contributes to a richer understanding of Dota 2's complex gameplay dynamics and strategies, helping to enhance the game experience for all stakeholders, from players and teams to game developers and spectators.

How many matches were played in each game mode? Which game mode is the most popular?

  Understanding the distribution of matches across game modes and identifying the most popular game mode is crucial for several reasons. First, it can inform game developers about players' preferences, guiding updates and changes to the game. For esports organizations, understanding which game modes are most commonly played can help design tournaments and competitions that attract a wider audience. Finally, for players and teams, knowing the most popular modes can help focus their practice and strategic preparations.

In [9]:
plotly.offline.init_notebook_mode()
In [ ]:
fig_gm = (df_filtered.groupby('game_mode').match_id.count().sort_values()
          .plot.bar(orientation='h'))

(
    fig_gm
    .update_layout(template="plotly_white",
                   title="Dota 2's three most popular game modes are All Pick, All Draft, and Single Draft",
                   xaxis_title="Number of Matches",
                   yaxis_title="Game Mode", showlegend=False)
    .update_traces(marker_color='red')
)

fig_gm.show(config={
    "editable": True,
    'toImageButtonOptions': {
        'format': 'png',  # one of png, svg, jpeg, webp
        'filename': 'fig_gm',
        'scale': 5  # Multiply title/legend/axis/canvas sizes by this factor
    }
})

Figure 3. Number of Matches by Game Mode

  Figure 3 revealed that the three most popular game modes are All Pick, All Draft, and Single Draft. The popularity of these game modes could be attributed to several factors. For example, All Pick mode, where players can choose any hero without restrictions, could be popular due to the flexibility it provides. All Pick provides an avenue for all players to try out whichever hero they like, regardless of skill level. Similarly, All Draft and Single Draft modes, where players have to strategize and make the best use of a limited pool of heroes, could be popular due to the strategic depth they offer, and when things get stale in All Pick. These findings can help players understand where they may need to focus their practice and game developers understand which modes are attracting the most players. The developers may also leverage this to understand which game mode to improve further or create limited-time events for. In particular, in-game events with exclusive rewards for the top three game modes will ensure a lot of playtime from the players.

What is the distribution of match durations in the dataset? Are there any outliers or patterns in the data?

  Analyzing the distribution of match durations can provide insights into the pacing of the game. Outliers could suggest particularly intense or strategically interesting matches that might be worth a detailed analysis for players seeking to improve their skills. Understanding the typical match duration can also help players manage their time effectively. From a game development perspective, trends in match duration can provide feedback on game balance and pacing.

In [ ]:
fig_duration = (df_filtered.duration_mins
                .plot.hist(bins=20, alpha=0.5, color='black'))

(
    fig_duration
    .update_layout(
        template="plotly_white",
        title="Majority of Dota 2 matches last around 40 mins",
        xaxis_title="Match Duration (minutes)",
        yaxis_title="No. of Matches"
    )
    .update_traces(marker_color='black')
)

fig_duration.show(config={
    "editable": True,
    'toImageButtonOptions': {
        'format': 'png',  # one of png, svg, jpeg, webp
        'filename': 'fig_duration',
        'scale': 5  # Multiply title/legend/axis/canvas sizes by this factor
    }
})

Figure 4. Distribution of Dota 2 Match Durations

  Looking at Figure 4, we found that most matches last around 40 minutes. This suggests that Dota 2 matches tend to be relatively long, reflecting the complex and strategic nature of the game and the commitment it requires from the player. While this is the average, it's also worth noting that match duration can vary widely, influenced by factors such as the skill level of the players, the strategies employed, and the specific heroes chosen. This information is useful for players planning their game sessions and for spectators setting aside time to watch matches.

Is there a relationship between the duration of a match and the type of win? Do radiant or dire wins result in longer match durations?

  Investigating the relationship between match duration and win type can reveal potential imbalances in the game. For instance, if Radiant wins consistently occur in shorter matches, it might suggest an inherent advantage for the Radiant side. These insights can be vital for game developers striving for balanced gameplay. For players and teams, such information can influence their strategies depending on whether they play on the Radiant or Dire side.

  In the context of this report, we used the Mann-Whitney U Test to compare duration of matches won by the Radiant team vs those won by the Dire team. In this case our null and alternative hypotheses are defined as follows:

$H_0$: Distribution of match duration of Radiant Wins is the same as the distribution that of the Dire Wins
$H_A$: Distribution of match duration of Radiant Wins is NOT the same as the distribution that of the Dire Wins

In [ ]:
fig_dur_win = go.Figure()

fig_dur_win.add_trace(
    go.Histogram(x=(df_filtered[df_filtered.win_type == 'radiant']
                    .to_pandas().duration_mins),
                 name="Radiant Win",
                 nbinsx=20,
                 marker_color='black')
)

fig_dur_win.add_trace(
    go.Histogram(x=(df_filtered[df_filtered.win_type == 'dire']
                    .to_pandas().duration_mins),
                 name="Dire Win",
                 nbinsx=20,
                 marker_color='red')
)

(
    fig_dur_win.update_layout(template="plotly_white",
                              xaxis_title="Match Duration (minutes)",
                              yaxis_title="No. of Matches",
                              title="Distribution shift of match durations of radiant-win vs dire-win matches")
)

fig_dur_win.show(config={
    "editable": True,
    'toImageButtonOptions': {
        'format': 'png',  # one of png, svg, jpeg, webp
        'filename': 'fig_dur_win',
        'scale': 5  # Multiply title/legend/axis/canvas sizes by this factor
    }
})

Figure 5. Distribution of Radiant Win Durations and Dire Win Durations

In [ ]:
display(
    mannwhitneyu(
        df_filtered[df_filtered.win_type == 'radiant']
        .duration_mins.to_numpy(),
        df_filtered[df_filtered.win_type == 'dire'].duration_mins.to_numpy(),
        alternative='two-sided')
)

MannwhitneyuResult(statistic=290826606793.5, pvalue=0.0)

  From Figure 5 alone, it is evident that there is a shift in the distribution of match durations of radiant-win matches vs dire-win matches. On average, it appears that it is faster to win as the radiant team. We further corroborate this based on the results of the Mann Whitney U test. Using $\alpha=0.05$, there is sufficient evidence to say that there is a significant difference in the duration times between Radiant wins and Dire wins. This could suggest a potential imbalance in the game, with the Radiant side possibly having certain advantages that allow them to win matches more quickly. This information is particularly valuable for game developers seeking to improve game balance and for players and teams looking to refine their strategies based on their starting side.

What is the relationship between the first blood time and the duration of the match? Do matches with earlier first blood tend to result to shorter matches?

  First blood, or the first kill of the game, is often seen as setting the tone for the rest of the match. By studying the relationship between the timing of first blood and the duration of the match, we can understand whether early aggression leads to shorter matches. These insights can significantly impact player strategies. For instance, if earlier first blood correlates with shorter matches, teams might prioritize early aggression to gain an advantage. Understanding this relationship also adds another dimension to the spectating experience, as fans and commentators can use this information to make predictions and analyze gameplay.

In [ ]:
fig_scatter = df_filtered.plot.scatter(x='first_blood_time', y='duration')

(
    fig_scatter.update_layout(template="plotly_white",
                              title="Timing of the First Blood does not dictate the match duration",
                              xaxis_title="First Blood Time",
                              yaxis_title="Match Duration",
                              showlegend=False)
    .update_traces(marker_color='red')
)

fig_scatter.show(config={
    "editable": True,
    'toImageButtonOptions': {
        'format': 'png',  # one of png, svg, jpeg, webp
        'filename': 'fig_scatter',
        'scale': 5  # Multiply title/legend/axis/canvas sizes by this factor
    }
})

Figure 5. Scatter Plot of First Blood Time against Match Duration

  The resulting plot appears to be dense at lower first blood times. This is aligned with domain knowledge as matches with lower skilled players as they are more likely to make mistakes that lead to a death. We observe a few outliers corresponding to higher skilled matches where netting that first kill takes a while. Contrary to what one might expect, we found that the timing of the first blood does not necessarily make the game shorter. This suggests that while getting the first kill might give a team an early advantage, it does not significantly impact the overall duration of the match. This could be due to the game's balanced design, where teams have opportunities to recover and stage a comeback even after losing the first blood. This finding could encourage teams not to be disheartened if they lose the first blood, as it doesn't dictate the overall length or outcome of the match.

What is the distribution of towers and barracks advantage in the dataset? Are there any patterns in the data?

 Towers are defensive structures located along each lane of the Dota 2 map. Towers attack enemy units that come within their range, providing a defensive barrier for the team that controls them. Destroying enemy towers grants gold and map control advantages to the team that destroys them. On the other hand, barracks are buildings situated in the base area of each team, two for each lane. Destroying enemy barracks weakens the strength of the creeps that spawn in that lane, making it easier to push and apply pressure on the enemy team's base. Understanding the distribution of towers and barracks advantage allows us to assess the strategic impact of these objectives on gameplay outcomes. This tower (barracks) advantage was obtained by counting the number of remaining towers (barracks) of the Radiant team and subtracting the number of remaining towers (barracks) remaining for the Dire team at the end of the match. Positive values indicate that Radiant has the advantage, while negative values indicate that Dire has the advantage.

In [ ]:
df_filtered = df_filtered.assign(
    tower_adv_radiant=(df_filtered.tower_status_radiant.astype(int)
                       .apply(lambda x: len(bin(x).replace('0', '')) - 2)
                       - df_filtered.tower_status_dire.astype(int)
                       .apply(lambda x: len(bin(x).replace('0', '')) - 2)),
    barracks_adv_radiant=(df_filtered.barracks_status_radiant.astype(int)
                          .apply(lambda x: len(bin(x).replace('0', '')) - 2)
                          - df_filtered.barracks_status_dire.astype(int)
                          .apply(lambda x: len(bin(x).replace('0', '')) - 2))
)

fig_tower_adv = go.Figure()

fig_tower_adv.add_trace(
    go.Histogram(x=(df_filtered[df_filtered.win_type == 'radiant']
                    .to_pandas().tower_adv_radiant),
                 name="Radiant Win",
                 nbinsx=20,
                 marker_color='red')
)

fig_tower_adv.add_trace(
    go.Histogram(x=(df_filtered[df_filtered.win_type == 'dire']
                    .to_pandas().tower_adv_radiant),
                 name="Dire Win",
                 nbinsx=20,
                 marker_color='black')
)

(
    fig_tower_adv.update_layout(template="plotly_white",
                              title="Distribution of Tower Advantage of Radiant vs Dire Wins",
                              xaxis_title="Tower Advantage",
                              yaxis_title="No. of Matches")
)

fig_tower_adv.show(config={
    "editable": True,
    'toImageButtonOptions': {
        'format': 'png',  # one of png, svg, jpeg, webp
        'filename': 'fig_tower_adv',
        'scale': 5  # Multiply title/legend/axis/canvas sizes by this factor
    }
})

Figure 6. Distribution of Tower Advantage of Radiant

In [ ]:
fig_barracks_adv = go.Figure()

fig_barracks_adv.add_trace(
    go.Histogram(x=(df_filtered[df_filtered.win_type == 'radiant']
                    .to_pandas().barracks_adv_radiant),
                 name="Radiant Win",
                 nbinsx=20,
                 marker_color='red')
)

fig_barracks_adv.add_trace(
    go.Histogram(x=(df_filtered[df_filtered.win_type == 'dire']
                    .to_pandas().barracks_adv_radiant),
                 name="Dire Win",
                 nbinsx=20,
                 marker_color='black')
)

(
    fig_barracks_adv.update_layout(template="plotly_white",
                              title="Distribution of Barracks Advantage of Radiant vs Dire Wins",
                              xaxis_title="Barracks Advantage",
                              yaxis_title="No. of Matches")
)

fig_barracks_adv.show(config={
    "editable": True,
    'toImageButtonOptions': {
        'format': 'png',  # one of png, svg, jpeg, webp
        'filename': 'fig_barracks_adv',
        'scale': 5  # Multiply title/legend/axis/canvas sizes by this factor
    }
})

Figure 7. Distribution of the Barracks Advantage of Radiant

  The distributions observed in Figure 6 and Figure 7 indicate a clear trend in Dota 2 matches where the winning team tends to significantly destroy the towers and barracks of the opposing team while preserving their own structures. This strategic advantage stems from the benefits obtained by the winning team through tower destruction, such as improved map control and increased gold accumulation, which further widens their lead. Additionally, demolishing the enemy's barracks weakens their lane creeps, facilitating the winning team's ability to apply pressure and mount offensive assaults on the opponent's base. Consequently, this creates opportunities to breach defenses and ultimately destroy the enemy's Ancient, securing victory. Notably, the higher frequency of beginner players compared to seasoned veterans means that the team that destroys towers first often gains an early advantage, leading to a "snowball" effect, whereby their advantages compound as the match progresses. This scenario poses challenges for less experienced players attempting to mount a comeback from such a disadvantaged position.

Conclusion and Recommendations

  This study utilized big data techniques to perform an exploratory data analysis on a dataset from Dota 2 matches. We discovered important insights into various aspects of the game, such as the distribution of matches across game modes, the average duration of matches, the relationship between the type of win and match duration, and the impact of first blood timing on the duration of matches.

  Our findings highlighted the popularity of All Pick, All Draft, and Single Draft game modes, suggesting that these modes offer the strategic depth and flexibility that Dota 2 players prefer. We also found that the typical Dota 2 match lasts around 40 minutes, though this can vary significantly based on various factors. Interestingly, our analysis indicated a significant difference in match duration between Radiant and Dire wins, with Radiant wins generally resulting in shorter matches. This could point to potential imbalances in the game that might be of interest to both players and game developers.

  Finally, our study showed that the timing of the first blood does not necessarily dictate the overall length or outcome of the match, highlighting the balanced nature of the game where teams have opportunities to recover from early setbacks. However, the study indicates that on average tower and barracks advantages tend to be extreme in most matches, indicating that it is a crucial factor in winning a match. Therefore, players should focus on gaining map control by destroying towers rather than gaining kills.

Recommendations

  • Expand the dataset. While our study used a significant portion of the available data, expanding the dataset to include more recent matches or matches from different periods could provide further insights. It would also be interesting to consider other variables not included in this analysis, such as individual player performance metrics.
  • In-depth analysis of game modes. Our findings revealed the most popular game modes, but a deeper dive into the characteristics of these game modes and why they are preferred by players could help make the unpopular game modes more palatable to players.
  • Investigate the Radiant/Dire imbalance. The observed difference in match durations between Radiant and Dire wins warrants further investigation. Future work could explore this in more depth, perhaps by looking at the specific strategies or heroes that contribute to this disparity.
  • Player Behavior Analysis. This study focused on game outcomes and mechanics. Another interesting area of study could be player behavior, such as communication patterns, cooperation, and player roles, which could also significantly influence match outcomes.

References

[1] OpenDota (formerly YASP) Data Dumps. (2019, November 30). Academic Torrents. https://academictorrents.com/collection/opendota-formerly-yasp-data-dumps.

[2] OpenDota Matches. (2023, May 15). OpenDota API. https://docs.opendota.com/#tag/matches%2Fpaths%2F~1matches~1%7Bmatch_id%7D%2Fget

[3] What is Parquet?. (2023, May 15). Databricks. https://www.databricks.com/glossary/what-is-parquet

[4] McKnight, P. E., & Najab, J. (2010). Mann‐Whitney U Test. The Corsini encyclopedia of psychology, 1-1.